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ABSTRACT 

We present a comprehensive map of over 1 million 
polyadenylation sites and quantify their usage in 
major cancers and tumor cell lines using direct RNA 
sequencing. We built the Expression and 
Polyadenylation Database to enable the visualization 
of the polyadenylation maps in various cancers and to 
facilitate the discovery of novel genes and gene 
isoforms that are potentially important to tumorigen- 
esis. Analyses of polyadenylation sites indicate that a 
large fraction (~30%) of mRNAs contain alternative 
polyadenylation sites in their 3 untranslated regions, 
independent of the cell type. The shortest 3' untrans- 
lated region isoforms are preferentially upregulated in 
cancer tissues, genome-wide. Candidate targets of 
alternative polyadenylation-mediated upregulation 
of short isoforms include POLR2K, and signaling 
cascades of cell-cell and cell-extracellular matrix 
contact, particularly involving regulators of Rho 
GTPases. Polyadenylation maps also helped to 
improve 3' untranslated region annotations and 
identify candidate regulatory marks such as 
sequence motifs, H3K36Me3 and Pabpd that are 
isoform dependent and occur in a position-specific 
manner. In summary, these results highlight the 
need to go beyond monitoring only the cumulative 
transcript levels for a gene, to separately analysing 
the expression of its RNA isoforms. 

INTRODUCTION 

Alterations in 3 f untranslated regions (UTRs) that drive 
post-transcriptional control of gene expression is a 



common theme in multiple diseases including cancer (1-3). 
Recent reports also provide surprising and compelling 
evidence that in proliferating/cancer cells, genes often 
switch their expression toward the short 3 f UTR isoforms 
that correspond to truncated versions of the canonical long 
isoforms (4,5). However, 3 r UTR isoforms are poorly 
annotated in databases (6) because common high- 
throughput technologies such as microarrays, RNA-Seq 
and quantitative reverse transcriptase-polymerase chain 
reaction cannot readily distinguish between 3 f UTR 
isoforms. This limitation arises because the long 3' UTR 
isoforms encompass the short isoforms, and therefore the 
short isoform- specific probes/primers hybridize to both 
short and long isoforms, leading to inseparable, mixed 
signals (5,7). Such limitations in precisely identifying and 
quantifying 3 f UTR variants can lead to erroneous conclu- 
sions in not only gene expression studies but also in investi- 
gations of posttranscriptional events (4,5). New approaches 
such as Direct RNA sequencing or DRS (8), 3P-Seq (6), 
PAS-Seq (9) and others (10,11) are beginning to be 
adopted to identify 3' UTR variants, allowing the construc- 
tion of a near-complete map of 3 f UTR variants of the 
human genome (9,12). 

Polyadenylation can generate many alternative tran- 
scripts of a given gene, with important consequences. 
Alternative polyadenylation (APA) can lead to truncated 
protein isoforms, abrogate protein-coding capacity, affect 
transcript stability, alter translation efficiency and affect 
transcript export (13). Important oncogenes such as p53 
and CCND1 are known to have altered polyadenylation 
that results from simple 3' UTR mutations (14,15). A 
related emerging theme in cancer biology is that APA 
within the same 3 f UTR can enhance production of onco- 
genes (5) because shorter 3' UTR isoforms have higher 
translational efficiency than their respective long 
isoforms. Although APA within the same 3 f UTR simply 
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results in mRNA isoforms that code for identical proteins, 
it can alter miRNA targeting regions (5), subcellular local- 
izations and stability (16), and protein production rate (13). 
Thus, tandem APA seems to provide an alternative 
mechanism to complement more subtle posttranscriptional 
regulatory modes such as miRNAs that can cause either 
translational arrest (17) or destabilize target RNAs 
(18,19) in human cells (20). The existence of a mechanism 
that preferentially alters the relative expression ratio of long 
and short isoforms is also observed in embryonic cells 
(9,21). The diversity of transcripts produced via polyadeny- 
lation and its consequences underscore the need to accur- 
ately catalog and study polyadenylation in both normal 
and diseased cells. 

We sought to build a comprehensive polyadenylation map 
of the human genome across major cancers and their cognate 
normal tissues, which could also facilitate studies on 3' UTR 
isoforms, and enable discoveries of important novel gene 
isoforms. To reduce artifacts associated with common 
deep sequencing strategies, we used DRS to construct the 
polyadenylation landscape and measure gene expression 
directly without manipulating purified RNA (8). We built 
a public resource for tissue-specific polyadenylation sites 
termed Expression and Polyadenylation .Database 
(xPAD), which represents a comprehensive analysis and dis- 
covery platform for the investigation of more than 1 million 
polyadenylation sites, 3' UTRs and their cellular usage, 
across five major organs, their cognate tumor samples, and 
six different cell types including commonly used ENCODE 
(22) cell types (Hela-S3, HepG2, K562 and MCF7). 
Approximately 9000 polyadenylation sites are well expressed 
(> 10 reads) and are at least 1000 nucleotides (nts) away from 
annotated gene regions, indicating the presence of many 
novel polyadenylated gene isoforms and genes. DRS reads 
also reveal potential polyadenylation in many non-coding 
RNAs including the IncRNA (23) GAS 5 that encodes 10 
snoRNAs, which are all unexpectedly polyadenylated. The 
simultaneous determination of gene isoforms and their levels 
also opens up additional opportunities in exploring the data, 
such as using network and motif analysis for studying gene 
isoforms. Indeed, network analysis of genes that manifest 
APA suggest that cell-cell and cell-extracellular matrix 
(ECM) contact mediated signaling cascades are potentially 
key targets of APA in cancer, particularly involving regula- 
tors of Rho family small GTPases that are consistently 
targeted by these mechanisms across the samples tested. 
The nucleotide level resolution of the polyadenylation 
maps also enabled the discovery of novel sequence motifs 
and regulatory marks (e.g. H3K36Me3) that are highly pos- 
ition specific with respect to polyadenylation sites and are 
isoform dependent. In summary, the polyadenylation site 
maps and its usage reported here reveal an extensive land- 
scape of both known and novel polyadenylated genes, their 
isoforms, and their regulation. 

MATERIALS AND METHODS 

DRS sequencing and genome mapping 

Matched pair (normal/tumor) total RNA from tissues was 
used for DRS (8,12); all paired RNA samples for 



sequencing were purchased from Biochain (Hayward, 
CA), except for Breast (Asterand, MI). For DRS 
analysis, the 3' blocking reaction was performed with 
3'deoxyATP (Jena Biosciences, Germany) by incubating 
the reaction mixture at 37°C for 30min. Raw DRS reads 
were quality filtered using HeliSphere (http://open 
.helicosbio.com) and in-house tools (DocSl) to remove 
low-quality sequences and reads shorter than 25 nts. The 
filtered reads were mapped to the human genome 
(GRCh37) with maximum allowed error of 10% 
(mismatches and indels) by MOSAIK (24) with conserva- 
tive parameters (-mmp 0.1 -mhp 100 -act 20 -hs 15 -p 8 - 
bw 13). Although internal priming in DRS data is negli- 
gible, we implemented filters for quality control, following 
a more stringent criterion than the recommended removal 
of sequences with a stretch of eight consecutive adenosines 
(25). For quality control, uniquely mapped reads were 
further filtered to remove all sequences that genomically 
contained either six consecutive Adenosines or at least 
seven Adenosines within 10 nts downstream of the end 
of the mapped DRS reads. 

The mapped locations were annotated using the UCSC 
genome browser tables (26). When a locus could be 
attributed to multiple possible annotations, the locus 
was assigned with a single annotation in the following 
priority order: 3' UTRs (sense), coding sequences (CDS, 
sense), 5' UTRs (sense), intron (sense), non-coding RNAs 
(ncRNAs, sense), 5' UTR antisense, CDS antisense, 3' 
UTR antisense, intron antisense, promoter antisense, 
ncRNA antisense and intergenic. Unlike other regions, 
intergenic regions are not separated into sense and anti- 
sense strands. Promoter regions were defined as regions 
1000 nts upstream of transcription start sites provided 
by UCSC genome browser tables. DRS reads mapping 
to regions that are at least 5000 nts away from the 
closest gene region (sense/antisense) were annotated as 
intergenic. 

Because polyadenylation sites vary by a few nucleotides in 
a given isoform, we used the snow-ball method to define 
polyadenylation sites (27). The snow-ball method iteratively 
clusters genomic locations that are located within 24 nts, 
using the genomically mapped 5' positions of the DRS 
reads as a reference. For analysis of genes containing 
tandem polyadenylation sites, polyadenylation sites with 
less than 10 total reads in the combined set of normal and 
tumor reads, were removed. The proximal polyadenylation 
site was defined as the first polyadenylation site after the end 
of the CDS on the 3' UTR. The distal polyadenylation site 
was defined as the last polyadenylation site after the end of 
CDS on the 3' UTR. If a gene contains more than two 
tandem polyadenylation sites on the 3' UTR, only the first 
and last sites are used for analysis (5). To account for vari- 
ations in sample loading, the number of reads from each 
polyadenylation site was standardized to contain an equal 
number of reads across all tissues, using the normal breast 
tissue reads as a reference. To enable more accurate com- 
parisons, the resulting expression levels were further 
quantile-normalized (R package) between normal and 
tumor within each tissue. To analyse whether quantile nor- 
malizations of the samples are justified, Quantile-quantile 
plots of raw read numbers (log2 scale) between tumor and 
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normal samples were analysed, which yielded high correl- 
ation (R = 0.98—0.99) along the diagonal. The distance 
between adjacent polyadenylation sites were defined as the 
distance between their 5' most cleavage sites. To determine 
variances in polyadenylation site locations, the mean 
location of each polyadenylation cluster (>10 reads) was 
determined. For a given cluster, polyadenylation sites 
corresponding to every read was used to determine the 
mean location of the polyadenylation site. For each 
polyadenylation site cluster, the distance between the 
polyadenylation site of each read and the mean location 
was used to calculate variance. 

Motif-enrichment analysis 

The total number of genes used for motif-enrichment 
analysis is identical for both short and long isoforms. 
Because some of the polyadenylation sites are tissue- 
dependent, motif analysis was constrained to short and 
long isoforms detected in a single tissue-type (normal 
breast). To ensure that the polyadenylation sites of short 
and long isoforms are well separated, we further con- 
strained the analysis to those sites that are at least 100 
nts apart, resulting in a total of 3270 genes. Motifs in 
regions flanking (±500 nts) the polyadenylation sites of 
the short and the long isoforms were detected based on 
the DREME (28) motif discovery algorithm, followed by 
a statistical assessment (E < 0.05) of the positional prefer- 
ence of the motifs, using CentriMo (29), a position-specific 
motif-enrichment analysis method. The analysis resulted 
in a total of 27 motifs, six of which were eliminated 
because they occur in a limited number of sequences, cor- 
responding to less than 5% of the sequences at those dis- 
tinguishing positions that are most enriched for the motifs. 
To focus on isoform-dependent motifs, the observed pref- 
erence of each motif was further tested by the 
bootstrapping approach to determine statistical signifi- 
cance, and a final non-redundant set of eight motifs 
were selected using default parameters of TOMTOM (30). 

Statistical tests 

Both the Ansari-Bradley and F tests were used to test 
whether the variances in locations of polyadenylation-sites 
in short and long isoforms are statistically different 
(MATLAB). As F- tests yielded more statistically signifi- 
cant numbers, we report only the P values obtained using 
the most conservative test (one-tailed Ansari-Bradley). 
For testing the observed increase/decrease in tumor cells 
for the absolute or relative expression of short and long 
isoforms, the non-parametric one tailed Wilcoxon signed 
rank test was used. Bootstrapping analysis for motif de- 
tection was performed by randomly sampling (with re- 
placement, n = 3270), the complete set of either short or 
long isoforms to yield 10 different datasets for each 
isoform. The resulting distribution of occurrences for 
each motif at their characteristic location in each of the 
10 datasets were calculated for both short and long 
isoforms, and compared using two-tailed student t test. 



RESULTS 

Genomic features of polyadenylation sites 

Five matched pairs (10 samples) of tumor and normal tissues 
from human breast, colon, kidney, liver and lung were used 
to obtain DRS reads using a total of 10 channels of the 
Helicos Sequencer. Both normal and tumor tissue samples 
yielded comparable number of reads. For example, a single 
channel for breast tissue generated 10 222436 quality- 
filtered reads ranging from 25 to 64 nts, of which 2 628 790 
reads remained after retaining high confidence reads (see 
'Materials and Methods'). The final reads dominantly 
mapped within ± 10 nts of known 3' ends of human genes 
expressed in normal breast tissue (Figure 1A). A similar 
pattern, in agreement with known 3' UTR annotations is 
seen across all other normal and tumor tissues, 
demonstrating that the inferred polyadenylation maps ac- 
curately identify known polyadenylation sites (Supple- 
mentary Figure SI A). The total number of genes that were 
expressed in a given cancer type (reads > 10) and contained 
tandem polyadenylation sites were approximately the same 
across organs: 4170 (breast), 4405 (colon), 4268 (lung), 3128 
(liver) and 4728 (kidney). Consistent with previous reports, 
we find that although polyadenylation sites frequently vary 
(31,32) the average absolute deviation based on all distinct 
locations within a cleavage site cluster (see Materials and 
Methods) is small (±2.7 nts), indicating that cleavage is 
highly precise. Intriguingly, perhaps due to differences in 
the factors (e.g. motifs) that govern the cleavage of the 
short and long isoforms, short isoforms manifest a statistic- 
ally significant higher variance for cleavage sites than long 
isoforms, across all tissues (Figure IB). 

In normal breast tissue alone, we identified 401 873 
distinct polyadenylation sites of which 36093 sites 
generated at least 10 reads. Among a total of 1 287 130 
polyadenylation sites that were identified across all 
tissues, 61 788 sites were sequenced multiple (>10) times 
in at least one tissue. The majority (73—90%) of the 
reads mapped to the sense strands (Figure 1C and 
Supplementary Figure SIB) of functionally important 
regions (375' UTR, coding sequence i.e. CDS, intron, 
non-coding RNA i.e. ncRNA, and promoter — see 
Materials and Methods). The observation that sense tran- 
scripts constitute the major fraction (~70%) of 
polyadenylated total RNA is fully in agreement with 
other studies (33,34). Notably, compared with other func- 
tional regions where polyadenylated antisense transcripts 
are rare, anti-sense transcripts are enriched in intronic 
regions across all 10 tissue types (Supplementary Figure 
SIB). Although most of these antisense transcripts are 
not abundant, ~17% of these transcripts occur in clusters 
separated by less than 2500 nts, suggestive of transcription 
and subsequent polyadenylation of antisense transcripts in 
intronic regions. Further analysis of intronic transcripts did 
not indicate any dependence between the length of the 
introns and the number of polyadenylation sites, but 
revealed that in comparison with terminal (5' or 3') 
introns, internal introns are significantly more enriched in 
both sense and antisense polyadenylated transcripts 
(Figure ID). The enrichment is consistently stronger for 
intronic sense transcripts than antisense transcripts. 



Nucleic Acids Research, 2012, Vol 40, No. 1 7 8463 



A 

A 








Co* 


ZD 






(/) 




-Q 

(0 


20- 


0> 


s_ 






15- 


cc 


Q 




»*— 




o 
a> 


10- 


U) 




(0 




+-< 
c 


5- 


a> 




£ 




a> 
Q_ 


0- 



B 



□ Normal Short 

□ Normal Long 

10 -3 1Q -61 



I Tumor Short 
I Tumor Long 



^60- 



30 



c 

>10 



0 J 




■ Sense 
□ Antisense 



-1000 0 1000 

Distance to 3' ends (nts) 



breast 



colon 



lung 



liver 



kidney 



^100 
n 

^ 90 
o 

3 80 



I 70 
I 60- 
w 50- 
.2 40- 
> 30 
3 20 
p 10 



□5' Antisense Q3' Antisense □ Internal antisense 
. B5' Sense ■ 3' Sense ■ Internal sense 



J Jill J 



□ 1 pA site 

□ 2 pA sites 

□ >2 pA sites 



70.4% 



2000 -| 




(/) 
Q) 
+-< 

'</> 
< 



? 1000- 



a> 
n 

E 

3 



19.2% 10.4% 




Breast 



Colon 



Kidney 



Liver 



Lung 



I 1 1 1 1 1 

0 1000 3000 5000 

Distance between pA sites (nts) 
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Although the polyadenylation site density in antisense 
internal intronic regions is lower than that of the cognate 
sense strand, it is always higher than that of terminal 
introns. The observed biases in intronic polyadenylation 
might be related to nucleosome depletion at internal 
introns (35) that may be more susceptible to aberrant 
intragenic transcription (36). Indeed, consistent with 
notion of aberrant transcription, the polyadenylation sites 
in these antisense regions follows a random pattern across 
most of the intron span, whereas the sense introns manifest 
a more biased distribution towards the 5 ; start of the 
introns (Supplementary Figure S2). 

Because tandem APA-mediated gene regulation that 
occurs without a change in the cognate protein sequence 
(37) is an emerging theme in cancer progression 
(4,5,38,39), we investigated whether polyadenylated 3 r 
UTR isoforms are commonly produced. Approximately 
30% of genes have multiple tandem polyadenylation 



sites in their 3 r UTRs in any given tissue (Figure IE and 
Supplementary Figure S3 A). Most of these genes have 
precisely two polyadenylation sites. The observed propor- 
tion is nearly identical to the previously reported percent- 
age of ^34%, determined using EST mapping (27). 
Similarly, in accordance with the EST-mapping study, 
for those genes with multiple tandem polyadenylation 
sites, the adjacent polyadenylation sites peak MOO nts 
apart (Figure IF and Supplementary Figure S3B) and 
have a median of 368 nts. Thus, the overall genomic char- 
acteristics of polyadenylation events captured by both 
DRS and ESTs are nearly identical. 

Polyadenylation patterns of non-coding and novel genes 

Non-coding RNAs such as miRNAs and IncRNAs (23) are 
frequently implicated in cancer progression (40) and in some 
cases are known to undergo polyadenylation (41,42). We 
analysed whether miRNA and IncRNA regions are 
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polyadenylated and if so whether those polyadenylated 
forms are aberrantly expressed in cancer. A modest propor- 
tion (12.3%) of the 14403 IncRNAs locations is 
polyadenylated within lOOnts of their annotated 3' ends. 
However, most of these polyadenylated locations are not 
abundant, as only a small fraction (1.7%; 244 loci) of the 
IncRNAs generates many (> 1 0) reads in at least one of the 1 0 
tissue samples tested. Abundantly polyadenylated IncRNAs 
loci include the well-known breast cancer metastasis- 
associated HOT AIR (41) that is highly (>12 fold) 
upregulated in breast tumor (49 versus. 4 reads). Although 
polyadenylated IncRNAs do not manifest a consistent 
pattern of differential expression across the majority of the 
tumors tested, the GAS 5 and TMEM191A are two good 
examples of polyadenylated IncRNA regions that are aber- 
rantly regulated in the majority of tumors tested (Figure 2 A 
and B). The 4kb long GAS5 that hosts 10 snoRNAs is fre- 
quently down-regulated gene in breast cancer (43). In DRS 
results, GAS5 is also 3-fold down-regulated in breast tumor. 
Notably, although only 36% of snoRNAs are polyaden- 
ylated in at least one of the tissues, all GAS5 snoRNA loci 
are usually polyadenylated in multiple tissues, generally 
within ~5 nts of their annotated 3' ends. Among all the 
miRNAs (939 loci) analysed, 25 (2.6%) locations are 
polyadenylated within 100 nts of the annotated 3' end of 
the annotated precursor miRNA (pre-miRNA) sequence, 
suggesting that polyadenylation among pre-miRNAs is 
rare. Among these, only three pre-miRNAs (mir-147b, 
—886, and —1975), generated at least 10 reads. Two of 
these loci, miR-886 and miR-1975 represent potentially in- 
correctly annotated miRNAs (44,45) as they derive from the 
longer non-coding RNAs, a vault RNA (VTRNA2-1) and a 
Y-RNA (RNY5). Interestingly, although the genomic 
location of one of the family members (let-7 1) of the 
well-known let-7 does not contain an annotated neighboring 
(300 000 nts) gene within its downstream region, a 
polyadenylation site (~40 reads in breast) that is within 
650 bases downstream of let-7 i is readily noticeable in xPAD. 

We next investigated the presence of novel polyadenylated 
genes that may be important for cancer development or pro- 
gression. To identify such novel RNA transcripts, we probed 
for novel polyadenylation locations that are at least 1 kb 
away from any known gene, is abundant (>10 reads) and is 
consistently either upregulated or downregulated in the 
majority of tissues. A total of 9612 potentially novel polya- 
denylated genes were identified, of which 77 were upregulated 
and 41 were downregulated in the majority (>3) of the tumor 
samples at more than 2-fold levels (DocS2). The DRS results 
on the differential expression of two of these locations were 
also confirmed by real-time reverse transcriptase-polymerase 
chain reaction (Figure 2C and D). These results underscore 
the notion that many additional RNAs and RNA-isoforms 
important for cancer and possibly for other diseases exist and 
that visualization tools for deep sequencing data could be 
useful in identifying such genes. 

Separate measurements of 3 f UTR isoforms reveal 
candidate APA targets 

A recent study based on microarrays and a carefully 
selected set of six genes demonstrated that short 



isoforms are more upregulated in cancer cells than their 
long isoforms for at least three genes tested, suggesting 
that the phenomenon could be widespread (5). Because 
DRS allows the separation of short and long isoforms 
readily (Figure 3A), we analysed whether we could gain 
additional insights into the increased usage of short 3' 
UTR isoforms in cancer. We first analysed whether the 
DRS reads accurately reflect genome-wide gene expression 
levels. To assess the reliability of the DRS data between 
two different experiments, the number of reads between 
each tissue type was compared. Repeatedly (>10) 
sequenced reads were highly correlated within each 
normal-tumor pair (log2 scale; Pearson correlation: 
breast, 0.86; colon, 0.89; kidney/liver, 0.84; lung, 0.80). 
As an additional genome-wide validation analysis to 
assess the overall accuracy of the data, we tested 
whether the DRS data can replicate previously reported 
patterns of relative changes between the long and short 
isoforms of genes expressed in each tissue type. We 
adopted the index (46) termed Relative Usage of Distal 
polyadenylation sites (RUD) and used the difference of 
the index between tumor and normal to monitor each 
gene's tendency to have a relatively shortened or 
lengthened 3' UTR in tumor cells (DocSl). The analysis 
indicates that in all tissue types, tumor tissues manifest a 
lower index than in the corresponding normal tissue 
(ARUD <0, Supplementary Figure S4), which is in full 
agreement with previous reports based on proliferating 
cells, cancer cell lines, or cancer tissues (4,5,38). We next 
made use of the number of reads that are directly assigned 
to short and long isoforms to evaluate the expression 
patterns of short and long 3 f UTR isoforms, separately. 
Analysis of the read-based abundance of short isoforms 
reveals that short 3 f UTR isoforms tend to manifest 
upregulation in all tumor tissues tested (Figure 3B). In 
contrast, the general expression pattern of long isoforms 
varied across tissues (Figure 3C). These results are con- 
sistent with previous reports that the preferential increase 
of short isoform levels is a general genome-wide phenom- 
enon of proliferating cells (4,5). Thus, the genome-wide 
data available via xPAD could serve useful in investigating 
changes at gene isoform-levels in cancer. 

As DRS reads provided a direct measurement of the 
levels of each individual 3' UTR isoforms, we sought to 
better define the core genetic programs affected by APA. 
We performed pathway analysis using those short 
isoforms that are preferentially more up-regulated than 
their long-isoforms in at least three tissues. Analysis 
of the 126 genes that are consistently affected by APA 
yielded top ranking pathways that are generally related 
to cell-cell (47) and/or cell-ECM contact-initiated sig- 
naling (48). Specifically, the top five statistically signifi- 
cant pathways include proteoglycan syndecan-mediated 
signaling (P = 0.01), the nectin adhesion pathway 
(P = 0.013), plasma membrane estrogen receptor signal- 
ing (P = 0.015) and the integrin signaling pathway 
(P = 0.017). Consistent with the inferred role of ECM, a 
more comprehensive network analysis of these genes im- 
plicates a large functional network containing two 
modules that are implicated in tumorigenesis 
(Supplementary Figure S5): the Rho family small 
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Figure 2. Genomic view of polyadenylated non-coding RNAs and novel gene locations that are aberrantly expressed in cancer using xPAD. (A-D) 
All polyadenylation sites detected by DRS reads (green, normal; red, tumor) are indicated for all four gene regions. (A, B) GAS5 (A) and 
TMEM191A (B) represent IncRNAs that are upregulated in the majority of tumor samples, as indicated. In contrast to the polycistronic GAS5 
which hosts multiple polyadenylated snoRNAs, polyadenylation of TMEM191A is limited to its 3 r UTR. (C and D) End locations and the 
expression levels of two potentially differentially regulated novel genes that are distantly located from known genes. Real-time PCR results also 
reveal similar expression patterns (fold change, P< 0.001); error bars represent standard deviation (n = 3). 



GTPases that receive signals from the ECM and control 
actin reorganization (49), and the RNA processing ma- 
chinery that cross-talks with Rho regulators and 
controls RNA expression, especially upregulation of Pol 
III transcripts (tRNAs, etc.) in cancer (50). Notable can- 
didate target genes from these two modules include 



POLR2K, a subunit shared by all three RNA polymerases, 
which can specifically improve the assembly of the Pol III 
pre-initiation complex (51), and the neuroepithelial cell 
transforming 1 gene NET1, which regulates multiple 
Rho GTPases and controls cell movement and tumor me- 
tastasis (52,53). Real-time PCR quantification of both 
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expression of long isoforms seems to flip between up/neutral/downregulation. 



NET1 and POLR2K transcripts across three different 
tissues also suggests that the short isoforms of these 
genes are indeed preferentially upregulated with respect 
to their long isoforms in various tumor samples 
(Supplementary Figure S6). These observations highlight 
the notion that separate analysis of gene isoforms could 
prove useful in defining new genetic signatures of cancers 
or the underlying pathways. 

Polyadenylation sites contain isoform-dependent 
regulatory marks 

The presence of differentially controlled short and long 
isoforms raises a fundamental question: how do cells dif- 
ferentiate between the two isoforms of a given gene? As 
several polyadenylation signals are known to be highly 
position specific, we tested for the presence of sequence 
elements that are located in a highly position-specific 
manner with respect to polyadenylation sites and are dif- 
ferentially used between the long and short isoforms. In 
particular, an AT-rich motif (TATATW) is highly 
preferred by short-isoforms and is position specific, 
occurring dominantly within 20nts upstream of the 
polyadenylation sites in short isoforms (Figure 4A and 
Supplementary Figure S7). Further analysis based on all 
possible hexamers indicates that its close relative AT AT AT 
is the most short-isoform-specific AT-rich hexamer 
(Supplementary Figure S8). ATATAT and its close 
variants correspond to a group of polyadenylation en- 
hancement elements that are located 30-50 nts upstream 
of the polyadenylation site in yeast (54), and are recognized 
by Hrpl, a protein involved in mRNA 3' end formation, 
surveillance, and export (55,56). Other motifs 



detected include the canonical polyadenylation signal, AA 
TAAA, which is biased towards the long isoform (9,57). 
The majority of the motifs are preferentially situated within 
±20 bases of the polyadenylation sites, indicating 
that these motifs are likely involved in positioning 
cleavage/polyadenylation factors in an isoform-dependent 
manner. 

We next sought to identify proteins that may differen- 
tially regulate short and long isoforms. Based on a large 
number (708) of publically available ENCODE ChlP-Seq 
or RIP-Seq data sets of transcription factors, histones and 
RNA binding proteins, we found that the histone mark, 
H3K36Me3 is consistently more enriched at short isoform 
polyadenylation site locations than that of long isoforms 
in all 42 datasets tested. In contrast, Pol2 occupancies at 
polyadenylation sites of long isoforms in the correspond- 
ing cell lines are higher than that of short isoforms 
(Figure 4B and Supplementary Figure S9). ENCODE 
RIP-Seq data analysis revealed that polyA-binding 
protein 1, Pabpcl (Figure 4C), could be another candidate 
that preferentially associates with polyadenylation sites of 
short-isoforms. To ensure that the observed preference is 
not simply due to the mRNA binding ability of Pabpcl, 
the Pabpcl profile was compared with that of a related 
mRNA binding protein Elavl (58), performed using iden- 
tical conditions. The comparisons reveal that the Pabpcl 
is more preferentially enriched at short isoform locations, 
suggesting that Pabpcl likely has a role in regulating the 
polyadenylation sites of short isoforms (Figure 4C). In 
conclusion, multiple regulatory marks that correspond 
to both transcriptional and post-transcriptional regulatory 
complexes are present at locations proximal to 



Nucleic Acids Research, 2012, Vol 40, No. 1 7 8467 



A 



A 




-500 0 500 -500 0 500 -500 0 500 -500 0 



B C 




Distance [nts] ► —long —short 
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Chip-Seq/RIP-Seq. 



polyadenylation sites in an isoform-dependent manner, 
providing a suggestion as to the interplay between these 
proteins and APA. 



DISCUSSION 

The biogenesis of eukaryotic RNA transcripts is made 
possible by a myriad of protein complexes that act in 
concert, orchestrating key tasks such as transcription, 
splicing, capping, 3' end cleavage and polyadenylation. 
Although gene regulation by transcription is a widely 
studied process, the effects of other terminal processes 
such as polyadenylation that regulate transcript stability, 
transport and expression of RNA transcripts are poorly 
understood. Therefore, to help with the investigations of 
polyadenylation locations, 3' UTRs, their different 
isoforms and usage, we built comprehensive maps of 
polyadenylation sites and quantified the usage of each 
polyadenylated gene isoform. The complete 
polyadenylation landscape is made publicly available 
through our web-portal, xPAD (Figure 5), which inte- 
grates the widely used UCSC genome browser (26) to 
enable detailed investigation of any genomic region by 
intuitive queries involving gene name, keywords or gene 
position. 



Important functional consequences of polyadenylation 
have emerged recently (14,15,20,59,60). In particular, the 
disparate usage of various polyadenylation sites within the 
same 3' UTR is known to result in reprograming of gene 
expression in proliferating cells (4), embryonic cells (9,21) 
and cancer cells (38). The availability of precise 
polyadenylation sites and their usage provides not only 
independent confirmation to earlier reports but also gen- 
erates additional insights into the participation of some 
functional modules that may be important targets of 
tandem APA -mediated reprogramming of gene expres- 
sion (Supplementary Figure S5). Two such modules are 
noteworthy: the Rho GTPases pathway and the RNA 
processing machinery. Rho family small GTPases are a 
class of important cellular signaling molecules with 
strong implications in cancer (61). Rather than Rho 
GTPases themselves, their regulators are emerging as the 
major targets in cancer (61), a theme that recurs in our 
network analysis, where all four Rho GTPases in the 
network are not targeted, whereas their regulators and 
effectors within the network are targeted (Supplementary 
Figure S5, Rho GTPases module). Rho GTPases are 
activated by all of these regulators in the network, which 
leads to actin re-organization, a process important for 
cancer cell migration and metastasis (49). Another import- 
ant gene in the RNA processing machinery is POLR2K, 
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the only subunit common to all three polymerases that 
dose dependency improves the assembly of the Pol III 
pre-initiation complex (51,62). As Pol III products 
(tRNAs, 5S rRNAs and 7SL rRNAs) are required for 
protein synthesis and tumors have unusually high levels 
of Pol III activity (63), it is possible that the upregulation 
of POLR2K could facilitate Pol III assembly and thus 
contribute to cell proliferation and cancer development. 

The different polyadenylated transcript isoforms could 
be regulated the transcriptional (64,65) as well as 
posttranscriptional (60) levels, and the role of these two 
processes in polyadenylation may not be readily separable 
(66). As most regulatory mechanisms that act in 3' UTRs 
are thought to be repressive and not activating (5), it is 
likely that if the causative factor is a canonical 
UTR-regulation mechanism such as that of miRNAs, 
then it must be downregulated in cancer. An alternative 
explanation is that non-canonical, isoform-dependent, 
UTR regulatory factors that do not repress, but activate 
(e.g. polyadenylation/cleavage proteins) short isoform 
levels are upregulated in cancer. Identification of potential 
isoform-dependent regulatory roles for Pabpcl, 
H3K36Me3, and the ATATAT motif that is linked to 
the yeast heterogeneous nuclear ribonucleoproteins 
(hnRNP)-like protein Hrpl highlights the possibility of 
such isoform-dependent factors. Hrpl contains two 



tandem RNA Recognition Motifs domain, an arrange- 
ment shared by various human hnRNPs. Although the 
protein sequence of Hrpl is similar to all human 
hnRNPs, the Hrpl mRNA sequence is most similar to 
hnRNPA3. Several hnRNP proteins seem upregulated in 
tumor, including one of the hnRNPA3 isoforms that is 
upregulated in four of the five tumor samples. Similarly, 
across all five samples, Pabpcl reads in tumor samples are 
higher than that of normal tissues; manifesting greater 
than 2-fold up-regulation in lung and kidney samples. 
Given the emerging theme that polyadenylation is inter- 
linked to other key processes such as transcription elong- 
ation and posttranscriptional regulation (64,67), some of 
these regulatory marks such as H3K36Me3 may play a 
dual role in controlling both transcription and 
polyadenylation of the isoforms. 

In summary, we built a comprehensive polyadenylation 
and APA map, which is made publically available as a 
webserver (johnlab.org/xpad) and as a UCSC track hub 
(http://www.johnlab.org/xpad/Hub/UCSC.txt). We have 
highlighted potential uses of the resource in studying 
polyadenylation-mediated gene regulation and generated 
new insights into the regulation of polyadenylation with 
an emphasis on gene isoform-dependent signatures in 
cancer. Although these observations will need to be 
followed up with more detailed studies, xPAD serves as 
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a unique tool to investigate various questions in biology, 
relating to polyadenylation, APA-mediated gene regula- 
tion, gene expression analysis, and discovery of new 
genes and gene isoforms. We aim to further expand 
xPAD with additional samples (tissue-types, cancer- 
subtypes, and cell lines), with the goal of defining 
gene-isoform signatures that may be useful for diagnosis 
or prognosis of specific human diseases, particularly in 
cancer (39). These studies which provide cell- type 
specific annotation and usage of 3' UTRs are expected 
to also help other genome-wide studies of UTRs such as 
miRNA target analysis (68,69), which could benefit by 
incorporating the expression levels of individual 3' UTR 
isoforms. 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Figures 1-9, Supplementary Methods, 
Supplementary Dataset and Supplementary References 
[70,71]. 
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